Building reliable and efficient data transfer and processing pipelines
نویسندگان
چکیده
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated flow control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures.
منابع مشابه
Experiences building Globus Genomics: a next-generation sequencing analysis service using Galaxy, Globus, and Amazon Web Services
We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuratio...
متن کاملAN EFFICIENT METHOD FOR OPTIMUM PERFORMANCE-BASED SEISMIC DESIGN OF FUSED BUILDING STRUCTURES
A dual structural fused system consists of replaceable ductile elements (fuses) that sustain major seismic damage and leave the primary structure (PS) virtually undamaged. The seismic performance of a fused structural system is determined by the combined behavior of the individual PS and fuse components. In order to design a feasible and economic structural fuse concept, we need a procedure to ...
متن کاملBuilding a Multi-Objective Model for Multi-Product Multi-Period Production Planning with Controllable Processing Times: A Real Case Problem
Model building is a fragile and complex process especially in the context of real cases. Each real case problem has its own characteristics with new concepts and conditions. A correct model should have some essential characteristics such as: being compatible with real conditions, being of sufficient accuracy, being logically traceable and etc. This paper discusses how to build an efficient mode...
متن کاملViennaNGS: A toolbox for building efficient next- generation sequencing analysis pipelines
Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from c...
متن کاملModeling Genome Data Processing Pipelines
In order to conduct analyses on genome data, different calculation steps have to be done in a specific order, which constitutes a genome data processing pipeline. Still a lot of research is in process, in order to find faster and more reliable ways to do various analyses, so single steps or the whole sequence of the pipelines might be subject to change. Amodular and flexibleway to configure pip...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Concurrency and Computation: Practice and Experience
دوره 18 شماره
صفحات -
تاریخ انتشار 2006